Expand description

Welcome to orc-format documentation. Thanks for checking it out!

This Rust crate is a toolkit to read and deserialize ORC to your favourite in-memory format.

Below is an example of how to read a column from ORC into memory:

use std::fs::File;

use orc_format::{error::Error, read, read::Column};


fn get_column(path: &str, column: u32) -> Result<Column, Error> {
    // open the file, as expected. buffering this is not necessary - we
    // are very careful about the number of `read`s we perform.
    let mut f = File::open(path).expect("no file found");

    // read the files' metadata
    let metadata = read::read_metadata(&mut f)?;

    // the next step is to identify which stripe we want to read. Let's say it is the first one.
    let stripe = 0;

    // Each stripe has a footer - we need to read it to extract the location of each column on it.
    let stripe_footer = read::read_stripe_footer(&mut f, &metadata, stripe, &mut vec![])?;

    // Finally, we read the column into `Column`
    read::read_stripe_column(&mut f, &metadata, stripe, stripe_footer, column, vec![])
}

To deserialize the values of a column, use things inside read::decode. For example, the below contains the deserialization of the “Present” to a Vec<bool>.

use orc_format::{error::Error, proto::stream::Kind, read::decode::BooleanIter, read::Column};

fn deserialize_present(column: &Column, scratch: &mut Vec<u8>) -> Result<Vec<bool>, Error> {
    let mut reader = column.get_stream(Kind::Present, std::mem::take(scratch))?;

    let mut validity = Vec::with_capacity(column.number_of_rows());
    BooleanIter::new(&mut reader, column.number_of_rows()).try_for_each(|item| {
        validity.push(item?);
        Result::<(), Error>::Ok(())
    })?;

    *scratch = std::mem::take(&mut reader.into_inner());

    Ok(validity)
}

Check out the integration tests of the crate to find deserialization of other types such as floats, integers, strings and dictionaries.

Modules

Contains Error

APIs to read from ORC